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Method for Analyzing Mass Spectra 

RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Patent 
Application Nos. 60/249,835 filed November 16, 2000 and 60/254,746 filed 
December 1 1, 2000. These U.S. Provisional Patent Applications are herein 
incorporated by reference in their entirety for all purposes. 

FIELD OF THE INVENTION 

. Embodiments of the invention relate to methods for analyzing mass 

spectra. 

BACKGROUND OF THE INVENTION 

Recent advances in genomics research have led to flie identification of 
numerous genes associated with various diseases. Howevw, while genomics research 
can identify genes associated with a genetic predisposition to disease, there is still a 
need to characterize and identify markers such as proteins. A **maricer'' typically 
refers to a polypeptide or some other molecule that diflferentiates one biological status 
fix>m another. Proteins and otha: markers are unportant Actors in disease states. For 
example, proteins can vary in association with changes in biological states such as 
disease. They can also signal cellular responses to disease, toxicity, or other stimuli. 
When disease strikes, some proteins become dormant, while o&ers become active. 
Prostate Specific Antigen (PSA), for example, is a circulating serum protein that, 
when elevated, correlates with prostate cancer. If flie changes in protein levels could 
be rapidly detected, physicians could diagnose diseases early and unprove treatments. 

Identifying novel markers is one of ttie earliest and most diflBcult steps 
in the diagnostics and drug discovery processes. One way to discover if substances 
are markers for a disease is by determining if they are "differentially expressed" in 
biological samples from patients exhibiting the disease as compared to samples from 
patients not having the disease. For example, FIG. 1(a) shows one graph 100 of a 
plurality of overlaid mass spectra of samples Sxm a group of 18 diseased patients. 
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The diseased patients could hav^ fi>r example, prostate cancer. Another gt^h 102 is 
shown in FIG. 1(b) and illustrates a pluraUty of overlaid mass spectra of samples fiom 
a group of 18 normal patients. In each of ttie graphs 100, 102, signal intensity is 
plotted as a function of mass-to-charge ratio. The intensities of the signals shown in 
the graphs 100, 102 are proportional to tiie concentrations of markers having a 
molecular weight related to the mass-to-charge ratio A in the samples. As shown in 
the graphs 100, 102, atthemass-to-chai^eratio A, a number of signals are present in 
both pluraUties of mass spectra. The signals include peaks that represent potential 
markers having molecular weights related to the mass-to-charge ratio A. 

When the signals in the graphs 100, 102 are viewed collectively, it is 
apparent that the average intensity of the signals at flie mass-to-charge ratio A is 
higher m the sanq)les fiom diseased patients than the samples fiom the normal 
patients. The marker at the mass-to-diarge ratio A is said to be "differentially 
expressed" in diseased patients, because tiie concentration of fliis marker is, on 
15 average, greater in samples fiom diseased patients than in samples fiom normal 
patients. 

hi view of the data shown in HGS. 1(a) and 1(b), it can be generally 
concluded that the samples fiom diseased patients have a greater qoncentration of the 
marker with the mass-to-charge ratio A than the samples fiom normal patients. Smce 
the concentration of the marker is generaUy greater in samples fiom diseased patimts 
than in the normal samples, Uie marker can also be characterized as being 
"up-regulated" for the disease. If the concentration of the marker was generally less 
in the samples from diseased patients than in the samples fiom normal patients, the 
protein could be characterized as being "down-regulated". 

Once markers are discovered, they can be used as diagnostic tools. For 
example, with reference to the example described above, an unknown sample fiom a 
test patient may be analyzed using a mass spectrometer and a mass spectrum can be 
generated. The mass spectrum can be analyzed and the intensity of a signal at the 
mass-to-charge ratio A can be determined in the test patiait's mass spectrum. The 
signal int^isity can be compared to the average signal intensities at the 
mass-to-charge ratio A for diseased patients and nonnal patients. A prediction can 
then be made as to whether the unknown sample indicates that the test patient has or 
will develop cancer. For example, if the signal intensity at the mass-to-charge ratio A 
in the unknown sample is much cIosm- to the average signal intaisity at the 



20 



25 



-2- 



wo 03/031031 



PCT/USOl/44972 



mass-to-charge ratio A for the diseased patient spectra than for the normal patient 
spectra, then a prediction can be made that the test patient is more likely than not to 
develop or have the disease* 

While the described dififerential expression analysis is useful, many 
iniprovements could be made. For instance, analyzing the amount of a single marker 
such as PSA in a patient's biological sample is many times not sujBficienfly reliable to 
monitor disease processes. PSA is considered to be one of the best prostate cancer 
maiicCTS presently available. However, it does not always conectiy differentiate 
benign from malignant prostate disease. While the coiicentration of a marker such as 
PSA in a biological sample provides some ability to predict whether a test patient has 
a disease, an analytical method with a greater degree of reliability is desirable. 

Also, when a large number of mass spectra of a large number of 
biological samples are analyzed, it is not readily apparent which signals i^resent 
markers fliat might differentiate between a diseased state and a non-diseased state. A 
typical mass spectrum of a biological sample has numefous potential marker signals 
(e.g., greater than 200) and a significant amount of noise. This can make the 
identification of potentially significant signals and the identification of average signal 
differentials difiBcult. Consequentiy, it is difficult to identify and quantify potential 
markers. Unless the potential markers exhibit strong up-regulation or strong 
down-regulation, the average signal differential between samples from diseased 
patients and samples firom normal patients may not be easily discemable. For 
example, it is often difficult to visually determine that a cluster of signals at a given 
mass value in one group of mass spectra has higher or lower average signal intensity 
than a cluster of signals from another group of mass spectra. In addition, many 
potentially significant signals may have low intensity values. The noise in the spectra 
may obscure many of these potentially significant signals. The signals may go 
undiscovered and may be inadvertently omitted firom a differential expression 
analysis. 

It would be desirable to have better ways to analyze mass spectra. For 
example, it would be desirable to provide for a more accurate method for discovering 
potentially useful markers. It would also be desirable to provide an improved 
classification model that can be used to predict whether an unknown sample is 
associated or is not associated with a particular biological status. 

Embodiments of the invention address these and other problems. 
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SUMMARY OF THE INVENTION 

Embodiments of the iiivention relate to methods for analyzhig mass 
spectra. In embodiments of the invention, a digital conq)uter forms a classification 
model that can be used to differentiate classes of samples associated with different 
biological statuses. The classification model can be used as a diagnostic tool for 
prediction. It may also be used to identify potential markers associated with a 
biological status. In addition, the classification model can be formed using a process 
such as, for exdmplOy a neural network analysis. 

One embodiment of the invention is directed to a method that analyzes 
mass spectra usmg a digital computer. The method comprises: entering into a digital 
computer a data set obtained fiiom mass spectra from a plurality of samples, wherein 
each sample is, or is to be assigned to a class within a class set comprising two or 
more classes, each class characterized by a different biological status, and wherein 
each mass spectrum comprises data representing signal strength as a fimction of 
mass-to-charge ratio or a value derived from mass-to-charge ratio, and is formed 
using a laser desorption ionization process; and b) forming a classification model 
which discriminates between the classes in the class set, wherein forming comprises 
aualyzmg the data set by executing code that embodies a classification process. 

Another embodiment of the invention is directed to a method that 
analyzes mass spectra using a digital computer. The method comprises: a) entering 
into a digital computer a data set obtained from mass spectra from a plurality of 
samples, wherein each sample is, or is to be assigned to a class within a class set 
comprising two or more classes, each class characterized by a different biological 
status, and wherem each mass spectrum comprises data representing signal strength as 
a fimction of time-of-flig^t or a value derived &om time-of-flight, and is formed using 
a laser desorption ionization process; and b) forming a classification model which 
discriminates between the classes in the class set, herein forming conq)rises 
analyzing the data set by executing code embodying a classification process. 

Another embodiment is directed to a computer readable medium. The 
compute readable medium comprises: a) code for entering data derived from mass 
spectra 6om a plurality of samples, wherein each sample is, or is to be assigned to a 
class within a class set of two or more classes, each class charactrnzed by a different 
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biological status, and wherein each mass spectnim conq>rises data representing signal 
strengdi as a function of time-of-flight or a value derived from time-of-fIight, or mass- 
to-charge ratio or a value derived fiom mass-to-charge ratio, and is formed using a 
laser desorption ionization process; and b) code for forming a classification model 
using a classification process process, wherein the classification model discriminates 
between the classes in the class set. 

Another embodiment of the invention is directed to a method for 
classifying an unknown sanq>le into a class characterized by a biological status using 
a digital computer. The method comprises: a) entering data obtained fiom a mass 
spectrum of the unknown sanqple into a digital computer, and b) processing the mass 
spectrum data using a classification model to classify the unknown sanq>le in a class 
characterized by a biological status. The classification model may be formed usmg, 
for example, a neural network anal^s. 

Another embodiment of ttie invention is directed to a method for 
estimating the likelihood that an unknown sample is accurately classified as belonging 
to a class characterized by a biological status using a digital computer. The method 
comprises: a) entering data obtained ftom a mass spectrum of the unknown sample 
into a digital computer; and b) processing the mass spectrum data using a 
classification model to estimate the likelihood that the unknown sample is accurately 
classified into a class characterized by a biological status. The classification model 
may be formed using a classification process, and is formed using a data set obtained 
fi'om mass spectra of samples assigned to two or more classes with different 
biological statuses. 

In embodiments of the invention, the mass spectra being analyzed may 
be pre-existing mass spectra which, for example, may have been created well before 
flie classification model is formed. Alternatively, the mass spectra data may have 
been created substantially contemporaneously with the formation of the classification 
model. 

These and other raibodiments of the invmtion are described with 
reference to the Figures and die Detailed DescriptioiL 
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BRIEF DESaUPnON OF THE DRAWINGS 
FIG. 1(A) shows overlaid mass spectra for saiiq)les from diseased 

patieats. 

5 FIG. 1(B) shows overlaid mass spectra for samples &om normal 

patients. 

FIG. 2 illustrates a flowchart of a method for creating mass spectra 
according to an embodiment of the invention. 

FIG. 3 shows a graph of log normalized intensity as a fimction of 
10 identified peak clusters. The signal intensities. fi:om mass spectra firom two different 
groups of samples are shown in the graph. 

FIG. 4 shows a flowchart illustrating some preferred mass spectra 
preprocessing procedures according to an embodiment of the invention. 

FIG. 5 shows a flowchart illustratmg some preferred mass spectra 
15 preprocessing procedures and classification model formation procedures according to 
an embodiment of the invention. 

FIG. 6 shows a block diagram of a system according to an embodiment 
of the invention. 

FIG. 7 shows a classification and regression tree according to an 
20 embodiment of the invention. 

FIG. 8 shows a table showing the variable importance of different 
predictor variables. 

FIG. 9 shows gel views obtained &om different samples firom cancer 
patients and normal patients. 
25 FIG. 10 show spectral views obtained &om different samples fiom 

cancer and normal patients. 

DETAILED DESCRIPTION 

30 In embodiments of the invention, a data set obtained firom mass spectra 

is entered into a digital computer to fonn a classification model. The mass spectra are 
preferably obtained firom biological samples having known characteristics. In 
preferred anbodiments, the data set used to form the classification model is 
characterized as a "known" data set, because the biological statuses associated with 
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the biological samples are known before the data set is used to form the classification 
model. In comparison, an "unknown** data set includes data that is obtained from 
mass spectra of samples where it is unclear if die samples are associated with the 
biological statuses which are discriminated by the classification model when the mass 
5 spectra are formed. Unknown data may be derived 6om a biological sample fiom a 
test patient who is to be diagnosed using the classification model, lii some 
environments, the known data set is referred to as 'training data" 

For purposes of illustration, many of the examples described below 
refer to using a known data set to fomi a classification model. However, in some 
10 embodiments of the invention, the data set used to form the classification model may 
be an unknown data set For example, in a cluster analysis, mass spectra of unknown 
biological samples may be grouped togeflier if tfiey have similar patterns. Samples 
correqK>nding to each gForxp may be analyzed to see if they have a biological status in 
common. If so, then the samples in the group may be assigned to a class associated 
15 with the biological status. For exanqjle, after forming a group of mass spectra having 
common patterns, it may be determined that all spectra in the groi^ were obtained 
firom biological samples that were all exposed to radiation. The samples in the group 
may then be assigned to a class that is associated with the status "radiation exposed". 
Samples in other groupings can be assigned to classes characterized by other 
20 biological statuses common to the samples in the req)ective groupings. A 

classification model can thus be formed and unknown spectra may be classified using 
the formed classification model. 

In embodiments of the invention, each sample used is, or is to be 
assigned to a class of a set of two or more classes, and each class is characterized by a 
25 different biological status. For example, a first class of samples may be associated 
with a biological status such as a diseased state. A second class of mass spectra of 
samples may be associated with a biological status such as a non-diseased state. The 
samples in the first and second classes may fonn the class set. The mass spectra fiom 
each of the respective classes can contain data tiiat dififerentiates the first and flie 
30 second classes. 

Iq embodiments of the invention, each mass spectrum in the analyzed 
mass spectra could comprise signal strengUi data as a function of time^of-flight, a 
value derived firom time-of-flight (e.g. mass-to-charge ratio, molecular weight, etc.), 
mass-to-chaige ratio, or a value derived firom mass-to-charge ratio (e.g., molecular 
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weight). As known by those of ordinary skill in the ait, mass-to-chai^e ratio values 
obtained from a time-of-flight mass spectrometer are derived from time-of-flight 
values. Mass-to-charge ratios may be obtained in other ways. For example, instead 
of using a time-of-jQight mass spectrometer to detomine mass-to-charge ratios, mass 
spectrometers using quadrupole analyzers and magnetic mass analyzers can be used to 
determine mass-to-charge ratios. 

In preferred embodiments, each mass spectrum comprises signal 
strength data as a function of mass-to-chaige ratio. la a typical spectral view-^e 
mass spectrum, the signal strength data may be in the form of '•peaks" on a gr^h of 
signal intensity as a function of mass-to-chaige ratio. Each peak may have a base and 
an ^ex, where peak width narrows from the base to the apex. The mass-to-chaige 
ratio generally associated with the peak corresponds to the apex of the peak. The 
intensity of the peak is also generally associated with the apex of the peak. 

Generally, the mass-to-charge ratio relates to the molecular weight of a 
potential marker. For example, if a potential marker has a charge of +1, then the 
mass-to-charge ratio is equal to the molecular weight of the potential marker 
represented by the signal. Thus, while some mass spectra plots may show signal 
intensity as a function of molecular weight, the molecular weight parameter is m fiict 
derived from mass-to-charge ratios. 

While many specific embodiments of the invention discussed herein 
refer to the use of mass-to-charge ratios, it is understood that time-of-flight values, or 
other values derived from time-of-flight values, may be used in place of 
mass-to-charge ratio values ia any of the specifically discussed exen^)lary 
embodiments. 

Aldiough each mass spectnmi in the analyzed mass spectra can 
comprise signal strength data as a function of time of flight, the use of mass spectra 
having signal strength data as a function of mass-to-charge ratio is generally 
preferred Time-of-flight values for ions are machine dependrat, whereas 
mass-to-charge ratio values are machine independent. For example, in a 
time-of-flight mass spectrometry process, the time-of-flight values obtained for ions 
can depend on the length of the free flight tube in the particular mass spectrometer 
used. Different mass spectrometers witii different free flight tube lengths can produce 
different time-of-flight values for the same ion. This is not flie case for 
mass-to-charge ratios, since a mass-to-<:harge ratio is simply the ratio of the mass of 
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an ion to the charge of the ion. ClassiiBcation models created using mass-to-charge 
ratio values can also be iudependent of the particular mass spectrometer used to create 
them. 

The data set may comprise any suitable data and may be entered 
automatically or manually into a digital conq)uter. The data may be raw or 
preprocessed before being processed by the classification process run on the digital 
computer. For exanqile, the raw intensities ofsiguals at predetermined 
mass-to-charge ratios in the mass spectra may be used as flie data set Alternatively, 
the raw data may be preprocessed before the classification model is formed. For 
example, in some embodiments, the log values of the intensities (e.g., base 2) of the 
signals in the mass spectra may be used to form the data set 

The data set is entered into the digital computer. Conqiuter code that 
embodies a classification process uses the data set to form a classification model. 
Exemplary classification processes include hierarchical classification processes such 
as a classification and regression tree process, multivariate statistical analyses such as 
a cluster analysis, and non-linear processes such as a neural network analysis. In 
preferred embodiments, the data set is processed using a classification and regression 
tree process to produce a classification model such as a classification and regression 
tree. These and other classification processes and classification models are described 
in greater detail below. 

The created classification model may be predictive or descriptive. For 
example, the model can be used to predict whether an unfcaown test biological sample 
is or is not associated with a particular biological status. Altematively or additionally, 
the classification model may be interrogated to identify features in the data that 
differentiate the biological status(s) being analyzed. A feature includes any aspect of 
the mass spectra data that can differentiate the particular classes being analyzed 
Suitable features fliat can be identified include, but are not limited to, signal 
intensities or signal intensity ranges at one or more mass-to-charge ratios, signal 
shapes (e.g., peak sh^es), signal areas (e.g., peak areas), signal widths (e.g., peak 
widths such as at the bottom of a peak), the number of signals in each mass spectrum, 
etc. In a typical example, the classification model may indicate that a feature such as 
a particular signal intensity at a given mass-to-charge ratio differentiates diseased 
samples fiom non-diseased samples. In yet another example, the classification model 
may indicate that a combination of features differentiates diseased samples fiom 
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non-diseased samples. For example, signal intensity ranges for two or more signals at 
diff^^t mass-to-charge ratios may differentiate a diseased state from a non-diseased 
state. 

In another example, a suitable feature that may be identified as 
differentiating the different sample classes may be the firequency that signals occur at 
a particular mass-to-charge ratio within a class. For example, for a diseased class 
having 100 samples and a normal class having 100 samples, a signal of intensity Y at 
a mass-to-charge ratio X may be present in the mass ^ectra of 90 diseased class 
samples, but may be present in only in 10 samples fiom the nonnal class samples. 
Even thougji the average intensity of the signals is the same m both the diseased class 
and the normal class (i.e., an average intensity of Y), the higiher number of 
occurrences of the signal in the cancer patient class indicates that the feature 
differentiates the diseased class fiom flie nonnal class. A fiequency feature such as 
this can be identified using the classification modeL 

Any suitable biological samples may be used in embodiments of the 
invention. Biological samples include tissue (e.g., from biopsies), blood, serum, 
plasma, nipple aspirate, urine, tears, saliva, cells, soft and hard tissues, organs, semen, 
feces, urine, and the like. The biological samples may be obtained from any suitable 
organism including eukaryotic, prokaryotic, or viral organisms. 

The biological samples may include biological molecules including 
macromolecules such as polypeptides, protems, nucleic acids, enzymes, DNA, RNA, 
polynucleotides, oUgonucleotides, nucleic acids, caibohydrates, oligosaccharides, 
polysaccharides; fragments of biological macromolecules set forth above, such as 
nucleic acid fragments, peptide fragments, and protein fragments; complexes of 
biological macromolecules set forth above, such as nucleic acid complexes, 
protein-DNA complexes, receptor-ligand complexes, enzyme-substrate, enzyme 
inhibitors, peptide complexes, protein complexes, carbohydrate complexes, and 
polysaccharide coaq)lexes; small biological molecules such as ammo acids, 
nucleotides, nucleosides, sugars, steroids, lipids, metal ions, drugs, hormones, amides, 
amines, carboxylic acids, vitamins and coenzymes, alcohols, aldehydes, ketones, fatty 
acids, porphyrins, carotenoids, plant growth regulators, phosphate esters and 
nucleoside diphospho-sugars, synthetic small molecules such as pharmaceutically or 
therq)eutically effective agents, monomers, peptide analogs, steroid analogs, 
inhibitors, mutagens, carcinogens, antimitotic drugs, antibiotics, ionophores, 
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antimetabolites, amino acid analogs, antibacterial agents, transport inhibitors, 
surface-active agents (surfactants), mitochondrial and chloroplast function inhibitors, 
electron donors, carriers and acceptors, synthetic substrates for proteases, substrates 
for phosphatases, substrates for ^erases and lipases and protein modification 
5 reagents; and synftietic polymers, oligomras, and copolymers. Any suitable mixture 
or combination of the substances specifically recited above may also be included in 
the biological sanq)les. 

As noted above, the biological samples &om which the data set is 
created are assigned to a class in a set of two or more classes. Each class is 
10 characterized by a different biological status. Preferably, there are only two classes 
and two biological statuses; one for each of the two classes. For example, one class 
may have a biological status such as a diseased state while the other biological status 
may have a status such as a non-diseased state. 

As used herism, *T)iological status*' of a sample refers to any 
15 characterizing feature of a biological state of the sample or the organism or source 
firom which the sample is derived. The feature can be a biological trait such as a 
genotypic trait or a phenotypic trait. The feature can be a physiological or disease 
trait, such as the presence or absence of a particular disease, including infectious 
disease. The feature also can be a condition (environmental, social, psychological, 
20 time-dependent, etc.) to which the sample has been exposed. 

Genotypic traits can include the presence or absence of a particular 
gene or polymorphic form of a gene, or combination of genes. Genetic traits may be 
manifested as phenotypic traits or exist as susceptibilities to their manifestation, such 
as a susceptibility to a particular disease (e.g., a propensity for certain types of cancer 
25 or heart disease). 

Phenotypic traits include, for example, qjpearance, physiological 
traits, physical traits, neurological conditions, psychiatric conditions, response traits, 
e.g., or response or lack of response to a particular drug. Phenotypic traits can include 
the presence of absence of so-called '^normal" or ••pathological" traits, including 
30 disease traits. Another status is the presence or absence of a particular disease. A 
status also can be the status of belonging to a particular person or group such as 
different individuals, different families, different age states, different species, and 
different tissue types. 
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In some embodiments, die biological statuses may be, for oxaaaple, one 
or more of Ae following in any suitable combination: a diseased state, a nonnal 
status, a pathological status, a drug state, a non-drug state, a drug responder state, a 
non-drug responder state, and a benign state. A drug state may include a state where 
pati^t who has taken a drug, while a non-drug state may include a state where a 
patient has not taken a drug. A drug responds* state is a state of a biological sample 
in response to the use of a drug. Specific examples of disease states include, e.g., 
cancer, heart disease, autoimmune disease, viral infection, AMieimer's disease and 
diabetes. More specific cancer statuses include, e.g., prostate cancer, bladder cancer, 
breast cancer, colon cancer, and ovaiy cancer. Biological statuses may also include 
begmning states, mtennediate states, and tenxiinal states. For example, different 
biological statuses may include the beginning state, the intemiediate state, and the 
temiinal state of a disease such as cancer. 

Other statuses may be associated with different environments to which 
different classes of samples are subjected. Illustrative environments include one or 
more conditions such as treatment by exposure to heat, electromagnetic radiation, 
exercise, diet, geographic location, etc. For example, a class of biological samples 
(e.g., all blood sanq)les) may be from a group of patients who have been exposed to 
radiation and another class of biological samples may be fiiom a group of patients who 
have not been exposed to radiation. The radiation source may be an intended 
radiation source such as an x-ray machine or may be an unintended radiation source 
such as a cellular phone. In another example, one group of persons may have been on 
a particular diet of food, while another group may have been on a dififerent diet 

In other embodiments of the invention, the different biological statuses 
may correspond to samples that are associated with respectively different drugs or 
dmg types. In an illustrative example, mass spectra of samples fix>m persons who 
wCTe treated with a drug of known effect are created. The mass spectra associated 
with the drug of known effect may rq)resent drugs of the same type as the drug of 
known effect For instance, the mass spectra associated with drugs of known effect 
may represent drugs with the same or similar characteristics, structure, or the same 
basic effect as the drug of known effect Many different analgesic compounds, for 
example, may all provide pain relief to a person. The drug of known effect and drugs 
of the same or similar type might all regulate the same biochemical pathway in a 
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parson to produce the same effect on a persoa Characteristics of the biological 
pathway (e.g., up- or down-regulated proteins) may be reflected in the mass spectra. 

A classification model can be created using the mass spectra associated 
with the drug of known effect and mass spectra associated with diSen^t drugs, 
different drug types, or no drug at all. Once the classification model is created, a 
mass spectrum can then be created for a candidate sample associated with a candidate 
drug of unknown effect Using the classification model, the mass spectrum associated 
with the candidate sample is classified. The classification model can determine if the 
candidate sample is associated with tiie dmg of known effect or another drug of a 
different type. If, for example, the classification model classifies the candidate 
sample as being associated with the drug of known effect, then the candidate drag is 
likely to have the same effect on a person as the drug of known effect. Accordmgly, 
^nbodiments of the invention can be used, among ottier things, to discover and/or 
characterize drugs. 

I. Obtaining Mass Spectra 

The mass spectra may be obtamed by any suitable process. For 
example, the mass spectra may be retrieved (e.g., downloaded) firom a local or r&note 
server computer having access to one or more databases of mass spectra. The 
databases may contain libraries of mass spectra of different biological samples 
associated with different biological statuses. Alternatively, the mass spectra may be 
gaierated firom the biological samples. Regardless of how they are obtained, the mass 
spectra and the samples used to create the classification model are preferably - 
processed under similar conditions to ensure that any dianges in the spectra are due to 
ttie samples themselves, and not differences in processing. The mass spectra might be 
created specifically with a particular classification process in, mind, or might be 
created without reference to a particular classification process used on the data. 

In embodiments of the invention, a gas phase ion spectrometer mass 
may be used to create mass spectra. A "gas phase ion spectrometer'' refers to an 
apparatus that measures a parameter that can be translated into mass-to-charge ratios 
of ions formed when a sample is ionized into the gas phase. This includes, e.g,, mass 
spectrometers, ion mobility spectrometers, or total ion current measuring devices. 
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The mass spectrometer may use any suitable ionization technique. The 
ionization techniques may include for example, an electron ionization^ fast atom/ion 
bombardment, matrix-assisted laser desoiption/ionization (MAU)!), sur&ce danced 
laser desorption/ionization (SELDI), or electrospray ionization. 

hi some embodiments, an ion mobility spectrometer can be used to 
detect and characterize a marker. The principle of ion mobility spectrometry is based 
on the different mobility of ions. Specifically, ions of a sample produced by 
ionization move at different rates due to their difference io, e.g., mass, chaige, or 
sh^e, through a tube under the influence of an electric field The ions (typically in 
the form of a current) are registered at a detector and the output of the detector can 
then be used to identify a marker or other substances in the sanq)le. One advantage of 
ion mobility spectrometry is that it can be performed at atmospheric pressure. 

In preferred ^bodiments, a laser desoiption tune-of-flight mass 
spectFomet^ is used to create the mass spectra. Laser desoiption spectrometry is 
especially suitable for analyzing high molecular weight substances such as protems. 
For example, the practical mass range for a MAIJDI or a surface enhanced laser 
desorption/ionization process can be up to 300,000 daltons or more. Moreover, laser 
desoiption processes can be used to analyze coitq>lex mixtures and have high 
sensitivity. In addition, the likelihood of protem fragmentation is lower ui a laser 
desoiption process such as a MALDI or a surface enhanced laser 
desorptionylomzation process than in many other mass spectrometry processes. Thus, 
laser desoiption processes can be used to accurately characterize and quantify hi^ 
molecular weight substances such as proteins. 

In a typical process for creating a mass spectrum, a probe with a 
maiker is introduced into an inlet system of the mass spectrometer. The maiker is 
then ionized. After the marker ions are generated, the gmerated ions are collected by 
an ion optic assembly, and then a mass analyzer disperses and analyzes the passing 
ions. Theionsexiting the mass analyzer are detected by a detector. In a 
time-of-flight mass analyzer, ions are accelerated through a short high voltage field 
and drift into a high vacuum chamber. At the far end of the high vacuum chamber, 
the accelerated ions strike a sensitive detector surface at different times. Since the 
time-of-flight of the ions is a fimction of the mass-to-charge ratio of the ions, the 
elapsed time between ionization and impact can be used to identify the presaice or 
absence of molecules of specific mass-to-diarge ratio. 
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The time of flight data may then be converted into mass-to-charge 
ratios to generate a spectrum showing the signal strength of the markers as a function 
of mass-to-charge ratio. FIG- 2 shows a flowchart illustrating an exemplary method 
for converting mass spectra based on time-of-flight data into mass-to-charge ratio 
data. First, time offlightspectm are collected (step 16). Then, a smoothing filter is 
applied to the time of flight spectra (step 18). Typically, a significant amount of high 
fi:eque!icy noise is presait in the initially generated spectra. Various filters are 
applied to reduce noise without cornqiting the underlying signal. Then, a baseline is 
calculated (step 20). This removes a characteristic upward shift that can be 
characteristic of, for example, a MALDI or a surface enhanced laser 
desorption/ionization process. 

''Surface enhanced" desorption/ionization processes refer to fliose 
processes in which the substrate on which the sample is presented to the energy 
source plays an active role in the desorption/ionization process, la these methods, the 
substrate, such as a probe, is not merely a passive stage for sample presentation. 
Several types of surface enhanced substrates can be employed in a surface enhanced 
desorption/ionization process. In one example, the surfece comprises an affinity 
material, such as anion exchange groups or hydrophilic groups (e.g., silicon oxide), 
that preferentially bind certain classes of molecules. Exauq)les of such afBnity 
materials include, for example, silanol (hydrophiUc), C8 or C16 alkyl Qiydrophobic), 
immobilized metal chelate (coordinate covalent), anion or cation exchangers (ionic) 
or antibodies (biospecific). The sample is exposed to a substrate bound adsorbent so 
as to bind analyte molecules according to the particular basis of attraction, Typcially 
non-binding molecules are washed ofif. When the analytes are biomolecules, an 
energy absorbing material, e.g., matrix, is typically associated with the bound sample. 
Then a laser is used to desorb and ionize the analytes, which are detected with a 
detector. 

In another version, the substrate surface con:q)rises a bound layer of 
energy absorbing molecules, obviating the need to mix the sample with a matrix 
material, as in MAIJDL Surface enhanced desorption/ionization methods are 
described in, e.g., U.S. Patent 5,719,060 (Hutchens and Yip) and WO 98/59360 
(Hutchens and Yip) (U.S. Patent 6,255,047). When a laser desorbs a matrix including 
an energy absorbing material, some of the matrix material can also be desoibed along 
with the sample material being analyzed. The baseline calculation adjusts the spectra 
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to take into account flie presence of the signal due to desoibed matrix material. Once 
a baseline is calculated, a time of flight/mass transformation takes place (step 22). In 
this step, the time of fligjht data is converted into mass-to-charge ratios. Local noise 
values are then calculated (step 24). At low mass-to-charge ratios, a significant 
amount of noise is generated due to the desoAed matrix material. In an ionization 
desoiption process, desoiption of the matrix material is less likely at higher 
mass-to-charge ratios than at lower mass-to-charge ratios. Noise is therefore more 
likely at lower mass-to-charge ratios than at higher mass-to-charge ratios. 
Adjustments to the spectra can be made to correct for this effect After these 
corrections are made, the spectra update is complete (step 26). By processing mass 
spectra according to flie method shown m FIG. 2, the signal-to-noise ratio of the mass 
spectrum is in^roved, allowing better quantitation and comparison of potential 
markers. 

Mass spectra data generated by the desoiption and detection of maikers 
can be preprocessed using a digital computer after or before generating a mass spectra 
plot. Data analysis can include the steps of determining the signal strength (e.g., 
height of signals) of a detected marker and removing "outiieis" (data deviating fiom a 
predetermined statistical distribution). For example, the observed signals can be 
normalized Normalization is a process whereby the height of each signal relative to 
some reference is calculated. For example, a reference can be background noise 
generated by instrument and chemicals (e.g., an energy absorbing molecule) which is 
set as zero in the scale. Then, the signal strength detected for each marker or other 
substances can be displayed in tiie form of relative intensities in tiie scale desired 
(e.g., 100). Altematively, a standard may be admitted with tiie sample so that a signal 
&om tiie standard can be used as a reference to calculate relative intensities of the 
signals observed for each marker or other markers detected. 

The digital computer can transfoim flie resulting data into various 
formats for display. In one format, referred to as "spectrum view or retentate mq)," a 
standard spectral view can be displayed. The spectral view depicts the quantity of 
marker reaching the detector at each particular molecular weight In another format, 
referred to as "peak map," only flie peak height and mass information are retained 
from the spectrum view, yielding a cleaner image and enabling signals representing 
markers with nearly idoitical molecular weights to be more easily seen. In yet 
another format, referred to as "gel view," each mass from tiie peak view can be 
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converted into a grayscale image based on the height of each peak, resulting in an 
appearance similar to bands on electrophoretic gels. In yet another format, referred to 
as '3-D overlays," several spectra can be overlaid to study subtle changes in relative 
peak heights. la yet another format, referred to as a "difference map view," two or 
more spectra can be compared, conveniently highlighting signals representing 
maik^ and signals representing markers that are jxp- or down-regulated between 
sanq)les. Marker profiles (spectra) &om any two samples may be compared visually 
on one plot. Data that can be used to form the data set may be obtained from these 
and other mass spectra display formats. 

n. Forming the data set 

Once the mass spectra are obtained, a data set such as a known data set 
is formed. The data set comprises data that is obtained from the mass spectra of the 
class set of biological samples. The mass spectra data forming the data set can be 
raw, unprocessed data. For example, raw signal intensity values at identified mass 
values from the mass spectra may be used to form the data set In another example, 
raw signal patterns from mass spectra may be used to form the data set. 

In alternative embodiments, data may be prq)rocessed before it is used 
to form the classification model The mass spectra may then be processed in any 
suitable manner before being used to form the classification model. For example, tiie 
signals in the mass spectra may be processed by taking flie log values of the signal 
intensities, removing outliOT, removing signals which are less likely to be associated 
with potential maricers, removing signals which have low intensities, etc. 

In some embodiments, the data set may comprise raw or preprocessed 
pattern data that relates to the particular pattern of each mass spectrum. For exanq)le, 
for a mass spectrum comprising many signal peaks, the pattem of the signal peaks 
may constitute a fingerprint for the biological sample used to create the mass 
spectrum. The classification process can classify the dififerent spectra according to 
patterns or pattem segments that may be common to the spectra in the respectively 
different classes differentiated by the classification model. A con5)uter program such 
as a neural network program, for example, can receive plural mass spectra of known 
samples associated with known biological statuses. The neural network can be 
trained with the mass spectra data so that it can differentiate between mass spectra 
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patterns belonging to the respectively different classes. The trained neural netwoik 
can then be used to classify a mass spectrum associated with an unknown sample 
based on the pattern in the mass spectrum. 

In oflier embodiments^ tiie data set comprises data relating to the 
intensities of the signals in the mass spectra. In these embodiments, some or all of the 
signals in each mass spectrum may be used to form the data set For example, the 
intensities of less than all of the signals (e.g., peaks) in a spectra view type mass 
spectrum can be used to form the data set. In preferred embodimrats, mass-to-charge 
ratios are identified, and the identified mass-to-charge ratios are used to select signals 
fiom the mass spectra. The intensities of these selected signals can be used to form 
the data set By using data fiom less than all signals in each mass spectrum to form 
the data set, the number of data pomts that will be processed is reduced so that data 
processing occurs more rapidly. Data of signals that have a low likelihood of 
representing acceptable markers may be excluded fifom the data set 

Mass-to-charge ratios may be identified in any number of ways. For 
example, the mass-to-charge ratios may be identified by comparing the mass spectra 
of different classes having differ^t biological statuses. The mass-to-charge ratios of 
signals that are likely to differentiate the classes may be selected. The comparison 
may be performed manually (e.g., by a visual comparison) or may be done 
automatically with a digital computer. For exanq>le, mass spectra associated with 
different classes of samples can be visually compared with each other to detmnine if 
the intensity of a signal at a mass-to-charge ratio in a mass spectrum fiom one sample 
class is significantly greater than or less than a signal at the same mass-to-charge ratio 
in a mass spectrum fi:om a differait sample class, thus indicating potential differential 
expression. Mass-to-charge ratios where these signal differences occur may be 
selected. 

FIG, 3, for exaiiq)le, shows a graph of log (2) normalized intensity vs. 
the identified peak clusters. This plot displays the log base 2 normalized intensity 
values. Each intensity value in a peak cluster has the average intensity value 
subtracted so a value of zero represents no change fix>m the average. Each unit on the 
y-axis represents a two-fold difference fi-om the cluster average. Significantly up and 
down regulated proteins can be identified using a plot such as the one shown in FIG. 
3. FIG. 3 shows a graph of log normalized intensity as a fimction of different signal 
clusters. The signal intensities 6om mass spectra firom two different groups of 
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samples are shown in the gn^h. For example, the peak cluster 22 (on the x-axis) in 
FIG. 3 shows a wide variation between the data points fiom Group A and Group B. 
This indicates that the mass-to-charge ratio associated with peak cluster 22 can be 
identified as a candidate marker location. 

Alternatively or additionally, certain predefined criteria may be 
provided to first select certain signals or signal clusters. The selected signal clusters 
may then be used to identify particular mass-to-charge ratios. For example, signals or 
signal clusters having a signal intensity or average signal intensity above or below a 
certain signal mtensity threshold may be automatically selected. Mass-to-diarge 
ratios associated with these selected signals or signal clusters may ttien be identified 

Preferred methods including collecting mass spectra data, 
prq)rocessing the data, and processing the preprocessed mass spectral data to form a 
classification model can be described with refer^ce to FIGS. 4 and 5. With reference 
to FIG. 4, mass spectra of samples associated with different biological statuses are 
collected (step 27). The number of samples collected is preferably large. For 
example, in embodiments of the invention, the number of collected samples may be 
from about 100 to about 1000 (or more or less than fliese values). Preferably, all 
samples used to create the spectra are created under sunilar conditions so that 
differences between the samples are reflected in the spectra. 

Signals corresponding to the presence of a potential marker are 
identified in each spectrum. Each such signal is assigned a mass value. Signals 
above a predetermmed signal-to-noise ratio in each mass spectrum m the first group 
of mass spectra are then detected (step 28). In a typical example, signals with a 
signal-to-noise ratio greater than a value S may be detected. The value S may be an ^ 
absolute or a relative value. Then, signals at the mass-to-charge ratios in the mass 
spectra are clustered together (step 30). Signal clusters that meet predetemuned 
criteria are then selected. For example, in one embodiment, signal clusters having a 
predetermined number of signals can be selected (stq) 32). Clusters having less than 
the predetermined number are discarded. In a typical example, if flie number of 
signals in a cluster is less than 50% of the number of mass spectra, then the signal 
cluster can be discarded. In some embodiments, the selection process results in 
anywhere from as few as about 20 to more than about 200 selected signal clusters. 
Once the signal clusters are selected, the mass-to-charge ratios for these signal 
clusters can be identified (step 34). 
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Once the mass-to-charge ratios are identified, '^S£dng signals^ for the 
mass-to-charge ratios can be detemiined. Some of the mass spectra may not exhibit a 
signal at the identified mass-to-charge ratios. This group of mass spectra or the 
samples associated with the mass spectra can be re-analyzed to determine if signals do 
in fact exist at the identified mass-to-chaige ratios (step 3Q. Estimates are added for 
any missing signals (stq) 38). For spectra where no signal is found in a cluster, an 
intensity value is estimated &om the trace height or noise value. The estimated 
mtensity value may be user selectable. 

With reference to FIG. 5, once mass-to-charge ratios are identified, 
intensity values are detennined for each signal at the identified mass values for all 
mass spectra (step 46). The intensity value for each of the signals is normalized fiom 
0 to 100 to remove the effects of absolute magnitude (step 48). Then, fiie logarifiun 
(e.g., base 2) is taken for each normalized signal intensity (step 50). Takmg the 
logarithm of the signal intensities removes skew &am the measurements. 

The log normalized data set is then processed by a classification 
process (step 52) that is embodied by code that is executed by a digital computer. 
After the code is executed by the digital computer, the classification model is formed 
(step 54). Additional details about the fomiation of the classification model are 
provided below. 

m. Forming the Classification Model 

A classification process embodied by code that is executed by a digital 
computer can process the data set The code can be executed by the digital computer 
to create a classification model. The code may be stored on any suitable conq)uter 
readable media. Examples of computer readable media include magnetic, electronic, 
or optical disks, tapes, sticks, chips, etc. The code may also be written in any suitable 
computer programming language including, C, C++, etc. 

The digital computer may be a micro, mini or large frame compute 
using any standard or specialized operating system such as a Windows™ based 
operating system. In other embodiments, the digitial computer may simply be a one 
or more microprocessors The digital computer may be physicaUy sq)arate from the 
mass spectrometo' used to create the mass spectra. Altanatively, the digital computer 
may be coupled to or physically incoiporated into the mass spectrometer. Mass 
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Spectra data can be transmitted fiom flie mass spectrometer to flie digital computer 
manually or automatically. For example, in one embodiment, a known data set may 
first be obtained from a pluraKty of mass spectra. The known data set may then be 
manually entered into a digital compute running code that embodies a classification 
process. In anoflier embodiment, the goiwation and/or collection of mass spectra 
data, the preprocessing of the data, and the processmg of the preprocessed data by a 
classification process m^ be pafoimed using flie same physical computational 
^patatus. 

Jn some aaabodimraite, the known data set can be characterized as a 
tiaining set which can "train" a precursor to the classification model or a previously 
fijrmed classification model Hie classification model may be trained and learn as it 
is formed. For example, in a neural network, the known data set can be used to train 
tiie neural network to recognize dif&«ices between tiie classes of data that are 
entered into the neural network. After an initial classification model is formed, a 
15 larger number of samples can be used to further train and refine tiie classification 
model so that it can more accurately discriminate between flie classes used to form tiie 
classification model. 

In embodiments of flie invention, additional data m^ be used to fonn 
flie classification model. The additional data may or may not relate to mass spectra. 
For instance, in some embodiments, pre-existing marker data may be used in addition 
to a known data set to form tiie classification model. For example, mass spectra for a 
class of prostate cancer patient samples and a class of non-prostate cancer patient 
samples may be obtained. A known data set may be formed using tiie mass spectra. 
A classification model may be formed using flie known data set and pr&«xisting 
maiker data such as preexisting PSA diagnostic data (e.g., PSA clinical assay data). 
The additional pre-existing PSA diagnostic data can be used to help differentiate flie 
mass spectra to form flie classification model. For example, each mass spectrum may 
be evaluated to see if a signal at tiie mass-to-charge ratio conespondmg to PSA is 
more closely associated witii a signal intensity characteristic of prostate cancer or a 
signal intensity characteristic of non-prostate cancer. This information can be used to 
help assign flie mass spectrum and its corresponding sample to a prostate cancer or a 
non-prostate cancer class. In oflier embodiments, non-mass spectra data such as the 
sex, age, etc. of flie persons fiom which flie biological samples were taken may also 
be used to form a classification model For example, if men are more likely to have a 
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particular disease than women, then this information can also be used to help classify 
samples and form a classification model. 

Any suitable classification process may be used in embodiments of the 
invention. For exan^le, the classification process may be a hierarchical classification 
process such as a classification and regression tree process or a multivariate statistical 
analysis. A multivariate statistical analysis looks at patterns of relationships between 
several variables simultaneously. Exanq>les of midtivaiiate statistical analyses 
include well known processes such as discriminate function analysis and cluster 
analysis. Disoiminant function analysis is a statistical method of assigning 
observations to groups based on previous observations from each group. Cluster 
analysis is a method of analysis that represents multivariate variation in data as a 
series of sets. In biology, for example, the sets are often constructed m a hierarchical 
manner and shown in the form of a tree-like diagram called a dendrogram. Some 
types of cluster analyses and other classification processes are described in the article 
by Jain et al., "Statistical Pattern Recognition: A Review", IEEE Transactions on 
Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, January 2000. This article 
is incorporated herein by reference in its entirety. 

Alternatively, the classification process may use a non-linear 
classification process such as an artificial neural network analysis. An artificial 
neural network analysis can be trained using the known data set In general, an 
artificial neural network can predict the value of an output variable based on input 
from several other input variables that can impact it. The prediction is made by 
selecting from a set of known patterns the one that ^ears most relevant in a 
particular situation. An artificial neural network conceptually has several neuron 
elements (units) and connections between them. These units are categorized into 
three different layers or groups according to their functions. A first 'group forms an 
input layer that receives the data entered into the system. A second group forms an 
output layer fliat delivers the output data representmg an output pattern. A third group 
comprises a number of intermediate layers, also known as hidden laya*s that convert 
the input pattern into an output. 

Illustratively, a neural network can be trained to differentiate between 
laser desorption mass spectra associated with a diseased state and a non-diseased 
state. Then, a mass spectrum of a test biological sample can be created by a laser 
desorption process and data relating to this mass spectrum can be input into the 



-22- 



wo 03/031031 



PCT/DSOl/44972 



trained neural network. The tramed neural netwoilc can detetmine if the test 
biological sample is associated with the diseased state or non-diseased state. 

In embodiments of the invention, the classification process preferably 
includes a hierarchical, recursive partitioning process such as a classification and 
5 regression tree process. In embodiments of flie invention, the classification and 
regression tree process is embodied by compute code that can be executed by a 
digital computer. An exemplary classification and regression tree program is CART 
4.0 commercially available firom Salford Systems, Inc. (www.salford-systems.com). 

One specific classification and regression tree process is a binaiy 

1 0 recursive partitioning process. The process is binaiy because parent nodes are always 
split into exactly two child nodes and recursive because the process can be repeated 
by treating each dbild node as a parent. To partition a known data set, questions are 
asked of the known data set In embodiments of the invention, the data bemg 
partitioned are the mass spectra corresponding to the class set of biological samples. 

1 5 Each mass spectrum can be considered an ^instance" to be classified. An ex^plary 
question fliat may be used to partition the mstances may be ^s the signal intensity of 
the signal at the mass-to-charge ratio X greater than Y?" Each question subdivides 
the known data set into two groups of more homogeneous composition. Once a best 
split is found, the classification and regression tree process repeats the search process 

20 for each child node, continuing recursively until further splitting is inq)ossible or 
stopped. SpUtting is impossible if only one case remains m a particular node or if all 
the cases in that node are of the same type. 

The questions asked of the data set may be determined by a user or 
maybe automatically detennmed by a digital computer. In some embodiments, the 

25 questions can be arbitrarily generated by a digital computer and the quality of the data 
splitting determines if the question is accq)table. For example, a question may be 
asked of the data. If the partitioning results in a statistically significant split of the 
instances, the question may be kept and used to form the classification and regression 
tree. The classification and regression tree process identifies the optimal numb^ of 

30 questions required to classify the data, compensatmg for the effects of random error in 
each sample observation 

The classification and regression tree process looks at all possible 
splits for all predictor variables included in flie analysis. For example, for a data set 
with 215 instances and 19 predictor variables, the process considers up to 215 times 
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19 splits for a total of 4085 possible splits. Typically, all such ^lits are considered 
when forming a classification and regression tree. Consequently, the formed 
classification and regression tree process takes into account many different predictor 
variables in forming the classification model. For example, in a typical embodiment, 
data of signals at over 100 mas&>to-charge ratios in all mass spectra for the class set 
are taken into account when forming the classification model. In comparison, the 
differential expression analysis described above takes only one predictor variable into 
account. Consequently, the classification and regression tree embodiments can 
provide more accurate classification accuracy fiian other classification methods since 
more data 6om each mass spectrum is used to form the classification model 

To check the accuracy of the model, the classification and regression 
tree process may employ a computer-intensive technique called cross validation. In a 
typical cross-validation process, a large tree is grown and is then pruned back. The 
data set is divided into 10 roughly equal parts, each containing a similar distribution 
for die biological statuses being analyzed. The first 9 parts of the data are used to 
construct the largest possible tree. The remaining 1 part of data is used to obtain 
initial estimates of the error rate of selected sub-trees. The same process is then 
repeated (growing the largest possible tree) on another 9/10 of the data while using a 
different 1/10 part as the test sample. The process continues until each part of the 
data has been held in reserve one time as a test sample. The results of the 10 mini-test 
samples are then combmed to form error rates for trees of each possible size. These 
error rates are applied to the tree basfed on the entire data set. Cross validation 
provides fairly reliable estimates of the independent predictive accuracy of the tree. 
Even if an independent test sample is not available, a prediction can be made as to 
how accurately the tree can classify completely firesh data (e.g., data fi:om a plurality 
of unknown san:q>Ies). 

The classification and regression tree tiiat is created provides a 
representation of which of the predictor variables (if any) are responsible for the 
differences between sample groups. The classification and regression tree can be 
used for classification (predicting what group a case belongs to) and also be used for 
regression (predicting a specific value). It can also be used to identify features that 
may be important in discriminating between the classes being analyzed. For example, 
the classification model may indicate that one or more signal intensity values at 
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specific mass-to-charge ratios, alone or in combination, are important features that 
differentiate the classes bdng analyzed. 

The classification and regression tree graphically displays the 
relationships found in data. One primary output of fbe classification and regression 
S tree process is the tree itself The tree can serve as one aspect of a classification 
model that can be visually analyzed by a user. Unlike non-linear techniques such as a 
neural network analysis, the visual presentation provided by the tree makes the 
classification analysis very easy to understand and assunilate. As a result, users tend 
to trust the results of decision trees more than they do **black box*' classification 

10 models such as those characteristic of trained neural networks. This makes the 
classification and regression tree a desirable classification model for various health 
care and regulatory persomiel (e.g., the Food and Drug Administration), and patients, 
who may want to have a detailed understandmg of the analysis used to create the 
classification model. The trees can also be used to discov^ previously imknown 

1 5 connections between the data and the biological statuses being analyzed. 

The classification and regression tree process has other advantages 
over classification processes such as a neural network analysis. For example, 
classification and regression tree programs are more ef&cient than neural networks, 
which typically require a large number of passes of the training set data, sometimes 

20 numbering in tiie thousands. The number of passes required to build a decision tree, 
however, is no more than the number of levels in the tree. There is no predetermined 
limit to the number of levels m the tree, although the complexity of the tree as 
measured by the depth and breadth of the tree generally increases as &e number of 
predictor variables increases. 

25 Also, using the classification and regression tree model, features that 

may discriminate between the classes may be identified. The identified features in the 
data may be characteristic of the biological status(s) being analyzed. For example, the 
classification model may indicate that a combination of features is associated with a 
particular biological status. For example, the model may indicate that specific signal 

30 mtensities at difierent mass-to-charge ratios differentiate a diseased state from a 
non-diseased state. In comparison to conventional differential analysis processes, in 
embodiments of the invention, many different variables may be analyzed. The 
classification model can identify a single predictor variable or can identify multiple 
predictor variables that may differentiate the biological statuses being analyzed. 
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IV, Using the Classification Model 

The classification model may be used to classify an miknown saxnple 
5 into a biological status. In this method the mass spectrum of a test sample can be 
compared to the classification model associated with a particular biological status to 
determine whether the sample can be properly classified with the biological status. A 
mass spectrum of the unknown biological sample can be obtained, and data obtained 
fiom a mass spectrum of the unknown sample can be altered into a digital computer. 
10 The entered data may be processed using a classification model. The classification 
model may then classify the unknown sample into a particular class. The class may 
have a particular biological status associated with it, and the person can be diagnosed 
as having that particular biological status. 

This method has particular use for clinical q)plications. For example, 
15 in the process of drug discov^, one may wish to determine whether a candidate 
molecule produces the same physiological result as a particular drug or class of drugs 
(e.g., the class of seratonin re-uptake inhibitors) in a biological system. A 
classification model is first developed that discriminates biological systems based on 
e3q)osure to the drug or class of drugs of interest (e.g., persons or test animals). Then, 
20 the biological system is exposed to the test molecule and a mass spectrum of a sample 
fiom the system is produced. This spectrum is then classified as belonging or not 
belonging to the classification of known drag or groiqi of drugs against which it is 
being tested. If the candidate molecule is assigned to the class, fliis information is 
usefiil in detemiining whether to perform fiirther research on the drag. 
25 In another application, a classification model is developed that 

discriminates various toxic and non-toxic biological states. Toxic status can result 
from, e.g., exposure to a drag or class of drugs. That is, a classification model can be 
developed that indicates whether or not a drag or class of drags produces a toxic 
response in a biological system (e.g., in vivo or in vitro model systems including liver 
30 toxicity). Then, a drug that is in development or in clinical trials can be tested on the 
system to determine whether a spectrum from a sample from the system can be 
classified as toxic or not. This information also is useful in toxicity studies during 
drug development. 
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In another application, a classification model is developed that 
disciiminates between persons who are responders and non-responders to a particular 
drag. Then, before giving a drug to a person who is not known to be a responder or 
non-responder, a sample firom the person is tested by mass spectrometry and assigned 
to tiie class of responders or non-responders to the drug. 

In another application, a classification model is developed diat 
distinguishes person having a disease fiom those who do not have flie disease. Then a 
person undergoing diagnostic testing can submit a sanq)le for classification into the 
status of having the disease and not having the disease. Thus, this method is usefiil 
for clinical diagnostics. 

One embodiment is directed to analyzing cancer. Pathologists grade 
cancers according to their histologic appearance. Features of low-grade cancers 
include enlarged nuclei with a moderate increase in nuclear/cytoplasmic ratio, small 
number of mitoses, moderate cytologic heterogeneity, and retention of generally 
normal architecture. Features of high-grade cancers include barged, bizarre looking 
nuclei with a higih nuclear/cytoplasmic ratio; increased number of mitoses, some of 
which may appear atypical; and little or no resemblance to normal architecture. It is 
useful to develop a classification model that distinguishes a biological sample coming 
from un-diseased, low-grade cancer, and high-grade cancer, since this diagnosis often 
dictates thwapeutic decisions as well as can predict prognosis. The sample can be a 
sohd tissue biopsy or a fine needle aspirate of the suspected lesion. However, in 
another embodiment, the samples can derive ft^om more easily collected sources torn 
the group of individuals being tested, such as urine, blood or another body fluid. This 
is particularly usefiil for cancers that secrete cells or proteins into these fluids, such as 
bladder cancer, prostate cancer and breast cancer. Upon establishment of the 
classification model for these states, the model can be used to classify a sample fiom a 
person subject to diagnostic testing. In another application, a classification model is 
developed that discriminates between classes of individuals having a particular 
physical or physiological trait that is not pathologic. Then, individuals unknown to 
have the trait can be classified by testing a sample from flie individual and classifying 
a spectrum into the class having the trait, or outside the class having the trait. 

The classification model can also be used to estimate the likelihood 
that an unknown sample is accurately classified as belonging to a class characterized 
by a biological status. For instance, in a classification and regression tree, the 
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likelihood of potential misclassificatioii can be det^mined. Illustratively, a 
classification and regression tree model fliat differentiates a diseased state fiom a 
non-diseased state classifies an unknown sample from a patient Tbe model can 
estimate the likelihood of misclassification. If, for example, the likelihood of disease 
misclassification is less than 10%, flien the patient can be informed that there is a 90% 
chance that he has the disease. 

V. Systems including computer readable media 

Some embodiments of the invention are directed to systems including a 
conq)uter readable medium. A block diagram of an exemplary system incorporating a 
computer readable medium and a digital computer is shown in FIG, 6. The system 70 
includes a mass spectrometer 72 coupled to a digital computer 74. A display 76 such 
as a video display and a computer readable medium 78 may be operationally coupled 
to flie digital computer 74. The display 76 may be used for displaying output 
produced by the digital computer 74. The computer readable medium 78 may be used 
for storing instructions to be executed by the digital computer 74. 

The mass spectrometer can be operably associated with the digital 
computer 74 wifihout beuig physically or electrically coupled to the digital computer 
74. For example, data fiom the mass spectrometer could be obtained (as described 
above) and then the data may be manually or automatically entered mto the digital 
computer 74 using a human operator. In other embodiments, the mass spectrometer 
72 can automatically send data to flie digital computer 74 where it can be processed. 
For example, the mass spectrometer 72 can produce raw data (e.g., time-of-flight 
data) from one or more biological samples. The data may then be sent to the digital 
computer 74 where it may be pre-processed or processed. Instructions for processing 
the data may be obtained fiom the computer readable medium 78. After the data fiom 
the mass spectrometer is processed, an ou^ut may be produced and displayed on the 
display 76. 

The computer readable medimn 78 may contain any suitable 
instructions for processing the data from the mass spectrometer 72. For example, the 
computer readable medium 78 may mclude computer code for entering data obtained 
from a mass spectrum of an unknown biological sample into the digital computer 74. 
The data may tiien be processed using a classification model. The classification 
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model may estimate the likelihood ttiat the unknown sample is accaoately classified 
into a class characterized by a biological status. 

Although the block diagram shows the mass spectrometer 72, digital 
computer 74, display 76, and computer readable medium 78 in separate blocks, it is 
understood that one or more of these conqionents may be present in flie same or 
diflferent housings. For example, in some embodunents, the digital computer 74 and 
flie computer readable medium 76 may be present in the same housing, while the 
mass spectrometer 72 and the display 76 are in different housings. In yet other 
embodiments, all of the components 72, 74, 76, 78 could be formed into a single miit. 

EXAMPLE 

A plurality of mass spectra was generated from biological samples 
from a set of biological samples. The set included a first class of serum from normal 
patients and a second class of serum from patients with prostate cancer. A serum 
sample from each patient was run through a surface enhanced laser 
desorption/ionization system commercially available from Ciphergen Biosystems, 
Inc. of Fremont, California. C^Dhergen Biosystem's ProtemChip® technology was 
also used in this example. Additional details about ProteinChip® technology can be 
found at the Website wwwxiphergen.com. The resulting output for each sample was 
a mass spectrum plot of signal intensity vs. mass-to-charge ratio. Discrete peaks 
represented the signals in the mass spectra. 

The intensities of the signals at the particular mass-to-charge ratios 
corresponded to the amount of proteins having the particular mass-to-charge ratios. 
For example, high signal intensities indicate high concentrations of proteins. Signals 
in each mass spectrum were located, quantified, and selected. In this example, 
segments of a mass spectrum were considered acceptable signals if they had intensity 
values at least twice as great as the surrounding noise level. Signals in the mass 
spectra at approximately the same mass-to-charge ratios were clustered together in all 
mass spectra. After clustering, about 250 signal clusters were identified and were 
labeled PI through P250. Each signal cluster, PI through P250, corresponded to a 
specific mass-to-charge ratio and was characterized as a "predictor variable". 

The signal intensities at the identified mass-to-charge ratios for each 
mass spectrum formed the known data set. These signal intensities w^e entered into 
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a classification and regression tree program, CART 4.0, conunercially available fiom 
Salford Systems, Inc. (www.salford-systems.com). The program was executed by a 
digital computer. The digital computer formed a classification and regression tree. 
Using the data, each sample was classified as normal or cancer. 

After the mass spectra data was input, the digital computer produced a 
tree such as flie one shown in FIG. 6. In this example, class 0 is normal while class 1 
is cancer. Each mass spectrum can be characterized as an '^instance'* which is 
classified in the tree. 

Each box in the tree represents a **node". The top node. Node 1, is 
called the root node. The decision tree grows &om the root node, splitting the data at 
each level to form new nodes. Branches connect the new nodes. Nodes fliat do not 
experience fiuiher splitting are called terminal nodes. The terminal nodes in the tree 
shown m FIG. 6 are labeled Terminal Nodes 1 to 7. As will be e:q)lained in fiirther 
detail below. Terminal Nodes 1 to 7 can be used to classify an unknown sample and 
can thus be used for prediction. 

In each node, the nuyority sets the classification for the entire node. 
For example, Terminal Node 1 has four patients. Of these four patients, all four 
patients have cancer. Terminal Node 1 is therefore characterized as a cancer node. 
Because all instances have the same value (cancer), this node is characterized as 
''pure" and will not be split fiuther. If Terminal Node 1 included three cancer patients 
and one normal patient, the node would still be characterized as a cancer node smce a 
majority of the patients are cancer patients. In this example, the one normal patient 
would be considered incorrectly classified 

In FIG. 6, each node contains information about the number of 
instances at that node, and about the distribution of the biological status, cancer. The 
instances at the root node (Node 1) are all of the instances in the mass spectra data set. 
Node 1 contains 194 instances, of which 96 are normal and 98 are cancer. Node I is 
splits into two new nodes. Node 2 and Node 5. The data split is determined by 
detamining whether the average signal intensity for the cluster P127 is less than or 
equal to 3.2946. The average signal intensities, as well as tiie value 3.2946 were on a 
relative scale. If the answer to this question is yes, then the corresponding instances 
are placed in Node 2. If the answer to this question is no, then the corresponding 
instances are placed in Node 5. In this example, the mass spectra of 85 cancer 
patients and II nonmal patients had a signal intensity less than or equal to 3.2946 at 
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the mass-to-charge ratio associated with the predictor variable P127 and were placed 
in Node 2. The mass spectra of 85 normal patients and 13 cancer patients had a signal 
intensity greater than 3.2946 at the mass-to-charge ratio associated with the predictor 
variable P127 and were placed in Node 5. Similar partitioning using different 
5 splitting rules occurred at the other nodes to form flie tree. 

The prediction performance of flie classification and regression tree 
can be described with reference to the Tables 1 and 2. 



Table 1 - Misclassification for Learn Data 


Class 


N Cases 


NMiscIassified 


Percent Error 


0 (Normal) 


96 


0 


0 


1 (Cancer) 


98 


0 


0 




Table 2 - Misclassification for Test Data 


Class 


N Cases 


NMiscIassified 


Percent ]&ror 


0 (Normal) 


96 


9 


9.38 


1 (Cancer) 


98 


11 


11.22 



The classification and regression tree program divided the known data set into two 
groins. About 90% of the data was used as a learning set and about 10% was used as 
a test set A classification and regression tree is mitially formed using the learning set 
data. After flie tree was formed, it was tested with the remaining 10% test data to see 
how accurately the classification and regression tree classifies data. With reference to 
- Table 1, all of the learning set data was corrected classified using ttie formed 
classification and regression tree. With reference to Table 2, the percent error rates 
for classifying the normal case and the cancer case test data were 9.38% and 1 1 .22%, 
respectively. Conversely, flie classification success rate was 90.62% and 88.78 % for 
the normal cases and the cancer cases, respectively. 

Classification success rates such as these indicate that the classification 
and regression tree is a highly accurate model for classifying unknown biological 
sanqiles. In the classification process, multiple predictor variables are considered in 
the classification scheme. Much more data can be used fiom a mass spectrum to 
classify the sample associated with the mass spectrum than the previously described 
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difierential analysis procedure, which only uses average signal intensities at a single 
mass-to-charge ratio to classify a test patient Accordingly, the classification model 
can be more accurate in classifying a test patient ttien many conventional 
classification models. 

Once grown, Qie tree can be used to classify an unknown sample by 
starting at the root (top) of the tree and following a path down Hie branches until a 
terminal node is encountered. The path is deteimined by imposing the split rules on 
&e values of the predictor variables in the mass spectrum for fiie unknown sample. 
For example, if a mass spectrum of an unknown serum sample fi-om a test patient has 
signals with intensities of 1 .0, 0.05, and 0.9 at the mass-to-chaige ratios of predictor 
variables P127, P193, and P187 respectively, then the test patient would be classified 
in Node 1, Node 2, Node 3, and then finally Terminal Node 1. Terminal Node 1 is a . 
cancer node and the patient would be classified as being a cancer patient 

FIG. 7 shows a table of variable importance of each of some of the 
predictor variables (e.g., signal clusters). The variable inq)ortance table ranks the 
predictor variables by how usefiil they were in building the classification and 
regression tree. If a specific predictor variable strongly difTerentiates the mass spectra 
data, then it is important in building the classification tree. To calculate a variable 
importance score, CART looks at the improvement measure attributable to each 
variable in its role as a surrogate to a primary split The values of these improvements 
are summed over each node and totaled, and are scaled relative to the best performing 
variable. The variable with the highest sum of improvemOTts is scored 100, and all 
other variables will have a lower score ranging downwards towards zero. 

In FIG. 7, the classification model indicates tiiat the predictor variables 
P36, P127, and P90 are more important than other predictor variables in forming the 
classification and regression tree. They are consequently more important than other 
predictor variables m discriminating between flie classes, cancer and non-cancer. The 
mass-to-charge ratios associated with these predictor variables are also associated 
with potential markers that differentiate prostate cancer samples firom non-prostate 
cancer samples. Accordingly, the classification model can be used to identify one or 
more markers fliat may discriminate between classes being analyzed. 

The effectiveness of the tree model can be confirmed with reference to 
FIGS. 8 and 9. The views in FIG. 8 are gel views while the views in FIG. 9 are trace 
views. Hie spectra are zoomed into the signal represented by P127 at a 
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mass-to-charge ratio of 5075 daltons (charge = +1). FIGS. 8 and 9 show that markers 
in samples fix>m six prostate cancer patients and six non-prostate cancer patients are 
differ^tially e}q)ressed at the mass value of 5075 daltons corresponding to the 
predictor variable P127. As shown in the tree in FIG, 6, the predictor variable P127 is 
the first node in the tree. Also, as shown in FIG. 7, the predictor variable P127 was 
shown to be more effective in differentiating the prostate cancer class of samples fiom 
the non-prostate cancer patient class of samples than most other predictor variables. 

While the foregoing is directed to certain preferred embodiments of 
the present invention, other and further embodiments of the invention may be devised 
without departing fiom the basic scope of the invention. Such alternative 
embodiments are intended to be included within the scope of the present mvention. 
Moreover, the features of one or more embodunents of the invention may be 
combined with one or more features of other embodiments of the invention without 
departing firom the scope of the invention. 

All publications (6.g., Websites) and patent documents cited in this 
application are'incoiporated by reference in their entirety for all purposes to the same 
extent as if each individual publication or patent document were so individually 
denoted. By then- citation of various references in this document Applicants do not 
admit that any particular reference is "prior art** to their invention. 
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WHATISCLAIlVrEDTS: 

1 . A method that analyzes mass spectra using a digital computer, the method 
comprising: 

5 a) entering into a digital computer a data set obtained from mass spectra 

from a plurality of samples, wherein each sample is, or is to be assigned to a class 
within a class set comprisiag two or more classes, each class characterized by a 
different biological status, and wherein each mass spectrum comprises data 
representmg signal strength as a function of time-of-flight, mass-to-charge ratio, or a 

10 value derived from time-of-flight or mass-to-charge ratio, and is created using a laser 
ioinzation desorption process; and 

b) forming a classification model which discriminates between the classes 
in the class set, wherein forming comprises analyzing the data set by executing code 
that embodies a classification process. 

15 

2. The method of claim 1 wherein the mass spectra are selected from the group 
consisting of MALDI spectra, surface enhanced laser desorption/ionization spectra, 
and electrospray ionization spectra. 

20 3. The method of claim 1 wherein the class set consists of exactly two classes. 

4. The method of claim 1 wherein the samples comprise biomolecules selected 
from the group consisting of polypeptides and nucleic acids. 

25 5. The method of claim 1 wherein the samples are derived from a eukaryote, a 
prokaiyote or a virus. 

6. The method of claim 1 i^erem the different biological statuses comprise a 
normal status and a pathological status. 

30 

7. The method of claim 1 where the differ^t biological statuses comprise 
un-diseased, low grade cancer and high grade cancer. 
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8. The method of claim 1 wherein the different biological statuses conq)rise a 
drug treated state and a non-drug treated state. 

9. The method of claim 1 wherein flie different biological statuses comprise a 
dmg-responder state and a drug-non-responder state. 

10. The method of claim 1 wherein the different biological statuses comprise a 
toxic state and a non-toxic state. 

11. The mefliod of claim 10 wherein the toxic state results from eixposure to a 
drag. 

12. The method of claim 1 wherein the data set is a known data set, and each 
sample is assigned to one of the classes before the data set is ent^ into tie digital 
conqsuter. 

13. The method of claim 1 wherem forming the classification model comprises 
using pre-existing marker data to form the classification model. 

14. The method of claim 1 wherein flie data set is formed by: 

detecting signals in the mass spectra, each mass spectrum comprising 
data representing signal strength as a function of mass-to-charge ratio; 

clustering the signals having similar mass-to-charge ratios into signal 

clusters; 

selecting signal clusters having at least a predetermined number of 
signals with signal intensities above a predetermiaed value; 

identifying the mass-to-charge ratios corresponding to the selected 
signal clusters; and 

forming the data set using signal intensities at the identified 
mass-to-charge ratios. 

15. The method of claim 1 wherein fonning the classification model comprises at 
least one of identifying features that discriminate between the different biological 
statuses, and lea rning . 
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16. The method of claim 1 wherein the classification process comprises a neural 
network analysis. 

5 17. The method of claim 1 further comprising: 

c) interrogating the classification model to determine if one or more 
features discriminate between the different biological statuses. 

1 8. The method of claim 1 furtho' comprising: 

10 c) repeating a) and b) using a larger plurality of samples. 

19. The method of claim 1 wherein the classification process is a cluster analysis. 

20. The method of claim 1 fiirtfaer comprising fomiing the data set, wherein 
15 forming the data set conq)rises obtaining raw data from the mass spectra and then 

preprocessing the raw mass spectra data to form the data set 

21. The method of claim 1 wherein the different classes are selected fiom 
ejqwsure to a drug, exposure to one of a class of drugs and lack of exposure to a drug 

20 or one of a class of drugs. 

22. The method of claim 1 wherein the each mass spectrum comprises data 
representing signal strength as a function mass-to-charge ratio or a value derived from 
mass-to-charge ratio . 

25 

23. A method for classifying an unloaown sample into a class characterized by a 
biological status using a digital computer, the method comprising: 

a) entering data obtained from a mass spectrum of the unknown sample 
into a digital computer; and 
30 b) processing the mass ^ctrum data using the classification model 

formed by the method of claim 1 to classify the unknown sample in a class 
characterized by a biological status. 

23. The method of claim 23 wherein the class is characterized by a disease status. 
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24. The method of claim 23 wherein the different biological statuses conq)rise 
un-diseased, low grade cancer and high grade cancer. 

25. The method of claim 23 wherein the class is characterized by exposure to a 
drug of one of a class of drugs. 

26. The method of claim 23 wherein the class is characterized by response to a 
drag. 

27. The method of claim 23 wherein the class is characterized by a toxicity status. 

28. A method for estimating the likelihood that an unknown smxple is accurately 
classified as belonging to a class characterized by a biological status using a digital 
computer, the method comprising: 

a) entering data obtained fiom a mass spectrum of the miknown sample 
into a digital computer; and 

b) processing the mass spectrum data using the classification model 
formed by the method of claim 1 to estimate the likelihood that the unknown sanq)le 
is accurately classified into a class characterized by a biological status. 

29. A computer readable medium con:5)rising: 

a) code for entering data obtained fi:om a mass spectrum of an unknown 
sample into a digital computer; and 

b) code for processing the mass spectrum data using flie classification 
model formed by the method of claim 1 to classify the unknown saxnple in a class 
characterized by a biological status. 

30. A system comprising: 

a gas phase ion spectrometer; 

a digital conq)uter adapted to process data from the gas phase ion 
spectrometer; and 

the computer readable medium of claim 29 in operative association with the 
digital computer. 
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31. The system of claim 30 wherem the gas phase ion spectrometer is adapted to 
perform a laser desorption ionization process. 

5 32. A conq)uter readable medium conq>rising: 

a) code for entering data obtained from a mass spectrum of an unknown 
sample into a digital computer, and 

b) code for processing the mass spectrum data using the classification 
model formed by the method of claim 1 to estunate the likelihood that the unknown 

10 sample is accurately classified into a class diaracterized by a biological status. 

33. A system comprising: 

a gas phase ion spectrometer; 

a digital computer adapted to process data from the gas phase ion 
15 spectrometer; and 

the computer readable medium of claim 32 in operative association with the 
digital computer. 

34. The system of claim 33 wh^ein the gas phase ion spectrometer is adapted to 
perform a laser desoiption ionization process. 

35. A computer readable medium comprising: 
a) code for entering data derived from mass spectra torn a plurality of 

samples, wherein each sample is, or is to be assigned to a class within a class set of 
two or more classes, each class characterized by a different biological status, and 
wherein each mass spectrum comprises data representing signal strength as a function 
of time-of.flight, mass-to^harge ratio or a value derived from mass-to^harge ratio or 
time-of.flight, and is created using a laser desoiption ionization process; and 
b) code for forming a classification model using a classification process, wherein 
the classification model discriminates between the classes in the class set 

36. The computer readable medium of claim 35 wherein the classification process 
conq)ris6S a neiural network analysis. 
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37. A system comprising: 

a gas phase ion spectrometer; 

a digital conq)uter adapted to process data from die gas phase ion 
spectrometer; and 

the conq)uter readable medium of claim 35 in operative association with the 
digital conq)uter. 

38. The system of claim 37 wherein flie gas phase ion spectrometer is ad^ted to 
perform a laser desorption ionization pmcess. 



-39- 




SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



PCT/USOl/44972 



2/10 



z: 



16 



COLLECT TIME OF 
FLIGHT SPECTRA 



18 



APPLY SMOOTHING 
FILTER 



z: 



20 



CALCULATE BASELINE 



z: 



22 



APPLY TOF-MASS 
TRANSFORMATION 



24 



CALCULATE LOCAL 
NOISE VALUES 



z: 



26 



SPECTRA UPDATE 
COMPLETE 



F/G. 2. 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



3/10 



PCT/USOl/44972 



1.5- 



I- 

0.5- 
O- 
-0.6- 
-I- 

-1.5- 



^ 9 

•5** Jtl-'< <^ui^ ♦isiv. 



10 



GROUP A 



: — 

r-l 

20 22 
PEAK CLUSTER 

• GROUP B 



30 



40 



F/a J. 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



4/10 



PCT/OSOl/44972 



■27 



COLLECT MASS SPECTRA OF 
SAMPLES ASSOCIATED WITH 
DIFFERENT BIOLOGICAL TRAITS 



28 



DETECT SIGNALS ABOVE A 
PREDETERMINED S/N RATIO 



r 



30 



CLUSTER SIGNALS WITH 
SIMILAR MASS VALUES 



32 



SELECT SIGNAL CLUSTERS WITH 
PEAKS IN MORE THAN N 
SPECTRA 



r 



34 



IDENTIFY MASS VALUES FOR 
SELECTED SIGNAL CLUSTERS 



z: 



36 



DETECT TARGETED SIGNALS 



r 



38 



ADD ESTIMATES FOR 
MISSING SIGNALS 



F/G. 4. 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



5/10 



PCT/USOl/44972 



46 



DETERMINE SIGNAL 
INTENSITIES AT 
IDENTIFIED MASS 
VALUES FOR ALL 
SPECTRA 





^48 

^ f 


NORMALIZE SIGNAL 
INTENSITIES 






CALCULATE LOGS 
OF THE SIGNAL 
INTENSITIES 






PROCESS DATA USING 
CLASSIFICATION 
PROCESS 




/^54 


CREATE 
CLASSIFICATION 
MODEL 



FIG. 5. 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 PCT/USOl/44972 





6/10 




^76 






DISPLAY 






^74 


MASS 




DIGITAL 


SPECTROMETER 




COMPUTER 








^78 


70—^ 




COMPUTER 
READABLE 
MEDIUM 



FIG. 6. 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



PCT/USOl/44972 



7/10 








CO 


00 






05 


o> 


CD 

■a 


II 


II 


N 


o 




CO 


CO 






CO 


CO 


CM 


CO 


CO 




qI 


0 


0 




0) 

i Jo II 
■(5*??° 

1 O 



10 
00 



o 
II 

CO 



CO 

•§ ° 

O ^ II 

E ir ^ 

CO ^ CO 

I " 



II 

JO 

O 



CO 00 
a. O O 



00 

CD 
CO 
00 

CD 
CO 

£ 

CO 











lode 




CsJ 
II 


II 




(N=2; 


0 




ninal 


lass 


lass 


iTem 




0 


0 



CO :z j5 

^ II 

^ 2 i5 ^ 
Q. O O 



II 



CO 
CM 
CO 

o 
<=> 
II 

V 
CO 





CO a> ^ 

CO 17 " " 

a> z o ^ 

^ ^ CO CO 
O f*^ CO ^ 
^ GO JO CO 

5: o o 



d 
II 

V 

00 




Is: 
it 



LU 
O 

z 

II 



o 



II 

■<- o 
CO CO 

o o 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 PCT/USOl/44972 



8/10 

VARIABLE IMPORTANCE 



VARIABLE 


SCORE 




P36 


100.00 


llllllllll III II 1 1 III lllllllll II 

llllllllll III l^J 1 111 lllllllll II 


PI 27 


95.89 


llllllllll II llll II III! Illlllll 

IMMllllI li Mil 1 1 III 1 lliillll 


P90 


93.52 


lllllllll II III llllllllllillllllll 
lilllllll II 111 llllllllllillllllll 


P185 


91.57 


llll lllllll lllllllllllllli Illlllll 
nil lllllll lllllllllllllli Illlllll 


P128 


76.95 


II llllllllll llll mill mil 

II llllllllll nil llllll lllll 


P119 


63.80 


II mini II II 11 nil 
It lllllll II II II llll 


P73 


26.10 


1 1 III 
1 1 III 


P21 


18.18 


lllllll 
lllllll 


P193 


17.93 


1 11 1 


P50 


17.30 


1 1 


P94 


16.84 


1 1 


P163 


13.89 


Nil 
llll 


P158 


13 89 

1 w« WW 


1 11 
1 II 


P190 


13 89 

1 w*ww 


1 1 
1 1 


P15 


13 32 

1 W*Wfc 


1 III 
1 III 


P164 


11.77 


llll 

lit! 


P135 


11 61 

1 1 -W 1 


1 1 
1 1 


P16 


10 01 


II 
II 


P55 


9 89 

W ■ WW 


1 
1 


P250 

1 ^ w w 


g 89 

W ■ WW 


1 
1 


P54 


9.89 


1 
1 


P11 


9 89 

W • WW 


III 
III 


P187 


9.72 


Hi 
III 


P209 


9.69 


III 
111 


P46 


6.91 




P1 


6.58 


1 
1 


P78 

1 f w 


6 58 

\J» WW 


II 
II 


P18 

1 1 w 


3 43 




P6 


3 43 




P4 


3.43 




P19 


0.00 




P2 


0.00 




P5 


0.00 




P32 


0.00 




P3 


0.00 




P35 


0.00 




P23 


0.00 





FIG. 8. 



* 



wo 03/031031 PCT/USOl/44972 

9/10 



5000 



5050 



5100 



CO 
CO 



5150 




5200 
23989 L220 







Peek ot 5075 Oo corresponds to PI27 used as the first node in CART 





23990 L220 

24028 L220 

24033 L220 

24034 L220 

24035 L220 
4655 L220 
15782 L220 
I0II3 L220 
4744 L220 
4778 L220 
5068 L220 



GELVIEW OF PI27 



m. 9, 



SUBSTITUTE SHEET (RULE 26) 



wo 03/031031 



PCT/USOl/44972 



10/10 



5000 



5050 



CO 
CO 

3 



CO 
CO 



. 0 
20 



5100 



5150 



20 
10 

0 
20 

10 

0 
20 
10 

0 
20 
10 

0 
20 
10 
.0 

5000 



— — " 


23989 L220 


) ■ 

) 


23990 L220 


I — ' ■' ' ■ 

1 


24028 L220 


■ . 


24033 L220 


) ' 


24034 L220 


— ' — 1 
Peoltof 5075 Do corresponds to PI27used os the first node in CART 


24035 1 ?9Ii 




4655 L220 




15782 L220 




I0II3 L220 




4744 L220 




4778 L220 




5068 L220 



5050 5100 
TRACE/SPECTRUM VIEW OF PI27 



5150 



5200 



F/a /a 



SUBSTITUTE SHEET (RULE 26) 



INimiATIONiVL S£ARCH REPORT 



International application No. 

PCT/USO 1/44979 



A. CLASSIFICATION OF SUBJECT MATTER 

IPC(7) ;BOlD 59/4^ HOU 49/00 
USCL :250/282, S8i 

According to International Patent Classification (IP C) or to botli national classification and IPC 

B. FIELDS SEARCHED 

Minimum documentation searched (classification system followed by classification symbols) 

U.S. : 2S0/S8fi, S81 

Documentation searched other than minimum documentation to the extent that such documents are included in the fields 



lectronic data base consulted during the international search (name of data base and. where practicable, search terms used) 
NONE 



C. DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document with indication, where appropriate, of the relevant passages 



Relevant to claim No. 



A.E 
A 



US 6,329,652 81 (WINDIG et al) 11 December 2001 (11.12.2001), 
Figs. SA and SB. 

US 4,122,343 A (RISBY et al) 24 October 1978 (24.10.1978), col. 
4, line 60 - col. 5, Line 52. 



1-23. 23-38 
1-23, 23-38 



Further documents are listed in the continuation of Box C Q See patent family annex. 



Special cateigDriea of cited 

document deliidag the general itat» of the arc whidi Ii not 
considered to be ofpartieiilnr relevance 

eartier document puUixhed on or after the iotematton&l filing date 

document vrlucfa may throw doubti on priority cUim(a) or which U 
cited to establish the puUtcation date of another citation or other 
special reason (as spedGed) "Y* 

document reremng to an oral disclosure, use, exhibitbn or other 
nieans 

document published prior to the internacioRal filing date but later -Jt> 
thap the priority date chimed 



later document publuhed after Che htermttooal Hling date or priority 
date and not b conniet wttli the applieatioo bat cited to understand 
the principk or theory underlying the Invention 

document oT particular relevance; the dainwd faivention caimol be 
considered no>-ei or onoot be considered to Involve an inventive step 
when the document Is taken alone 

document oT particular relevance; the claimed inventioQ cannot bo 
considered to Involve an Inventive step when the document Is 
combined with one or more other such documents, such combination 
bebig obvious to a person ikilled tn the art 

docunent ra ember of the same patent GimUy 



Date of the actual completion of the international search 



16 MAY 2002 



Date of mailing of the international search report 



Wame and mailing address of the ISA/US 
Comniisstoner of Patents and Trademarks 
Box PCT 

Washington, D.C 20231 
Facsimile No. (703) S05-S230 



Authorized oflficer 

DAVID VANORE 
Telephone No. (703) 306-0246 



Form PCT/ISA/e 10 (second sheet) (July 1998)* 



